Exponentially Increasing the Capacity-to-Computation Ratio for Conditional Computation in Deep Learning
نویسندگان
چکیده
Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation “on-demand”, on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units. 1 Conditional Computation for Deep Nets Deep learning is about learning hierarchically-organized representations, with higher levels corresponding to more abstract concepts automatically learned from data, either in a supervised, unsupervised, semi-supervised way, or via reinforcement learning (Mnih et al., 2013). See Bengio et al. (2013b) for a recent review. There have been a number of breakthroughs in the application of deep learning, e.g., in speech (Hinton et al., 2012a) and computer vision (Krizhevsky et al., 2012). Most of these involve deep neural networks that have as much capacity (the number of units and parameters) as possible, given the constraints on training and test time that made these experiments reasonably feasible. It has recently been reported that bigger models could yield better generalization on a number of datasets (Coates et al., 2011; Hinton et al., 2012b; Krizhevsky et al., 2012; Goodfellow et al., 2013) provided appropriate regularization such as dropout (Hinton et al., 2012b) is used. These experiments however have generally been limited by training time in which the amount of training data that could be exploited. An important factor in these recent breakthroughs has been the availability of GPUs which have allowed training deep nets at least 10 times faster, often more (Raina et al., 2009). However, whereas the task of recognizing handwritten digits, traffic signs (Ciresan et al., 2012) or faces (?) is solved to the point of achieving roughly human-level performance, this is far from true for other tasks
منابع مشابه
Learning Curve Consideration in Makespan Computation Using Artificial Neural Network Approach
This paper presents an alternative method using artificial neural network (ANN) to develop a scheduling scheme which is used to determine the makespan or cycle time of a group of jobs going through a series of stages or workstations. The common conventional method uses mathematical programming techniques and presented in Gantt charts forms. The contribution of this paper is in three fold. First...
متن کاملConditional Computation in Neural Networks for faster models
Deep learning has become the state-of-art tool in many applications, but the evaluation and training of deep models can be time-consuming and computationally expensive. The conditional computation approach has been proposed to tackle this problem (Bengio et al., 2013; Davis & Arel, 2013). It operates by selectively activating only parts of the network at a time. In this paper, we use reinforcem...
متن کاملParallel computation framework for optimizing trailer routes in bulk transportation
We consider a rich tanker trailer routing problem with stochastic transit times for chemicals and liquid bulk orders. A typical route of the tanker trailer comprises of sourcing a cleaned and prepped trailer from a pre-wash location, pickup and delivery of chemical orders, cleaning the tanker trailer at a post-wash location after order delivery and prepping for the next order. Unlike traditiona...
متن کاملFast Finite Element Method Using Multi-Step Mesh Process
This paper introduces a new method for accelerating current sluggish FEM and improving memory demand in FEM problems with high node resolution or bulky structures. Like most of the numerical methods, FEM results to a matrix equation which normally has huge dimension. Breaking the main matrix equation into several smaller size matrices, the solving procedure can be accelerated. For implementing ...
متن کاملLow-Rank Approximations for Conditional Feedforward Computation in Deep Neural Networks
Scalability properties of deep neural networks raise key research questions, particularly as the problems considered become larger and more challenging. This paper expands on the idea of conditional computation introduced in [2], where the nodes of a deep network are augmented by a set of gating units that determine when a node should be calculated. By factorizing the weight matrix into a low-r...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1406.7362 شماره
صفحات -
تاریخ انتشار 2014